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Abstract. We use the graphics processing unit (GPU) for fast calculations of helicity amplitudes of physics 
processes. As our first attempt, we compute uu ^ 717 (n = 2 to 8) processes in pp collisions at ^ys = 14TeV 
by transferring the MadGraph generated HELAS amplitudes (FORTRAN) into newly developed HEGET 
(HELAS Evaluation with GPU Enhanced Technology) codes written in CUDA, a C-platform developed 
by NVIDIA for general purpose computing on the GPU. Compared with the usual CPU programs, we 
obtain 40-150 times better performance on the GPU. 
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1.1 Physics motivations 

The field of particle physics is about to enter a new era 
of experimental discovery at the Large Hadron Collider 
(LHC) which will start collecting data this year. Typical 
new physics signals at the LHC are expected to have many 
high pT jets, 7's, W^'s and Z's, and it is very important to 
estimate the Standard Model (SM) background to all the 
signals reliably. Computation of multiple particle produc- 
tion amplitudes is, however, often time consuming, and it 
is desirable to have tools that allow us to evaluate cross 
sections efiiciently. We examine here the possibility of us- 
ing a GPU (Graphic Processing Unit) to compute helicity 
amplitudes at many phase space points simultaneously, by 
taking advantage of its parallel processing capability. 



1.2 GPU hardware and software 

The GPU is a specialized integrated circuit (IC) used in 
the computer system which outputs complex images onto 
displays very fast. This IC is composed of many multi- 
processors and can process large amounts of image data 
in parallel with high efficiency. Because of this high perfor- 
mance CPUs are widely used in the scientific applications 
where a large number of calculations must be performed 
to process the data. 



Recently NVIDIA yy introduced the software devel- 
opment system, CUDA ppl . which enables one to develop 
programs which can be executed on the GPU using C/C-l— 1-. 
We use this system to develop a new helicity amplitude 
calculation package which can be used on a GPU. 



2 Physics process 

As our first attempt we compute uu — > 717 processes in 
pp-collisions, because the amplitudes are particularly sim- 
ple for numerical codes like HELAS [3j such that large 
numbers of Feynman diagrams, beyond the capability of 
automatic diagram generators like MadGraph can be 
computed without difficulty [5]. 



2.1 nj production in pp collisions 

The cross section for producing n high pT photons in pp 
collisions can be expressed in the leading order of pertur- 
bative QCD as 



da'' 



1 



J J dxidX2 [Dq/p {X1,Q) Dqlp (X2, Q) 



+ Dq/p (Xl, Q) Dq/p {X2,Q)\ [s] y) ,(1) 

where the factorization scale Q for the q and q distribution 
functions is chosen as the pt cut-off, and the qq n^^s 
process cross section is 
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Here 



2.2 Final state cuts 



SX1X2 , 



(3) 



and (7i{i = 1,2) denotes the helicity of the initial u and 
u, Xi{i — 1 to n) is that of the final photons. The n-body 
phase space is 



d$„ = (271)^(54 / ^+P2- ^ A:, m 



Lj^ (27r)3 2w, 



, (4) 



where coi — \ki\ for massless photons and 1/n! is the statis- 
tical factor for n identical photons. The helicity amplitude 
for the process 

q{pi,ai)+q{p2,a2) 7(^1; ^i)+7(fe) ^2)H l-7(fc„, A„) 

(5) 

is particularly simple: 

M^\]^-^'^" = M^l'-^-^ + (n! - 1) permutations , (6) 

where the full amplitude is obtained by (n! — 1) permuta- 
tion of one amplitude, denoted e.g. by the Feynman dia- 
gram at Fig. [TJ 



Al, 



tn-1 



u(pi,cri) 



^(fc„,A„)(eQ,)", (7) 



with 



i=l 



iPi - Kif . 



(8) 



Here Qq is the electric charge of the quark in units of e, 
the proton charge. 

The standard version of MadGraph/MadEvent [HISl 
[7] generates HELAS amplitudes and computes the cross 
section for the 717 production process up to n = 6 with- 
out difficulty. The HELAS amplitude for n = 7 with 7 ! 
diagrams can be generated by MadGraph after enlarging 
the maximum size of several variables. We are unable to 
make MadGraph generate 8 ! ~ 4 x 10'* diagrams for n = 8 
photons, so we evaluate them by coding the amplitude of 
eq. ^ and by summing over (8! —1) permutations of the 
8 photon momenta in a numerical program. 




Pi I'l 

Fig. 1. One of n! diagrams for qq — > 717. 



In order to simulate realistic LHC experiments, we intro- 
duce final state cuts for all the observed photons as follows: 



PTi > = 20 GeV, 

< 77^"' = 2.5, 

ARij > Zii?"'* = 0.4, 
where r]i is the rapidity of the i-th photon and 



(9) 



(10) 



measures rapidity and azimuthal angle separation between 
two photons. 

For the sake of simplicity and for the purpose of the 
present paper, we consider only one subprocess, uu 
nj's, for the ri-photon production processes at the LHC, 
by neglecting contributions from dd, ss, cc and bb colli- 
sions. 



3 Computation on the GPU 

On the GPU, a number of programs can be executed with 
its multi-processors in parallel. These programs which run 
concurrently must be the same for all multi-processors 
with different input data for each program. In order to 
utilize the high performance of the GPU for computing 
the n-photon production processes, we assign a program 
which processes one event on each processor. 

If the program is given a set of random numbers, it 
generates a phase space point and computes a Jacobian 
and a squared amplitude for the phase space point inde- 
pendently from the other programs. By keeping the in- 
dependence of programs among processors their structure 
can become simple, and we can take full advantage of the 
highly parallel computation on the GPU. 

In this section we describe our software developed for 
the computation of the n-photon productions in pp colli- 
sions on the GPU and its computation environment. 



3.1 Structure of the program 

For n-photon production processes our program generates 
momenta and helicities of photons in the final state, those 
of u and u in the initial state, and computes the total cross 
section of the physics process in the following order: 

1. initialization of the program, 

2. random number generation on the CPU, 

3. transfer random numbers to the GPU, 

4. generate helicities and momenta of u, u and ri7's using 
random numbers, and compute squared amplitudes on 
the GPU, 

5. transfer the momenta and helicities of external parti- 
cles, as well as computed squared amplitudes to the 
CPU, and 
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Table 1. Parameters of CPUs 





GeForce 


GeForce 


GeForce 




GTX280 


9800GTX 


8800M GTS 


Number of 


30 


16 


8 


multiprocessor 








Number of core 


240 


128 


64 


Total amount of 


1000 


500 


500 


global memory [MB] 








Total amount of 


64 


64 


64 


constant memory [kB] 








Total amount of shared 


16 


16 


16 


memory per block [kB] 








Total number of registers 


16 


8 


8 


available per block [kB] 








Clock rate [GHz] 


1.30 


1.67 


0.40 



6. sum up all values to obtain the total cross section and 
distributions on the CPU. 

Program steps between the generation of random num- 
bers (2) and the summation of computed cross sections (6) 
are repeated until we obtain enough statistics for the cross 
section and distributions. 

For the computation of physics quantities for the n- 
photon production processes, we prepare two types of am- 
plitude programs. They are: 

— a program converted from the FORTRAN program ob- 
tained with the MadGraph |4J, and 

~ a handwritten CUDA program using permutations of 
final state photons, via eq. (O . 

Both programs are written using the HEGET (HELAS 
Evaluation with GPU Enhanced Technology) functions, 
newly developed codes, written in CUDA [2 , to compute 
helicity amplitudes a la HELAS . 

Sample codes for the case of the 3-photon production 
process are shown in [Appendix A| as List [T] and List [2] for 
the converted MadGraph amplitude and the permutation 
amplitude, respectively. They compute a sum of ampli- 
tudes for all diagrams, ampsum, a complex number, from 
the given momenta, {pi, p5} and helicities, {nhl, 

. . . , nh5} of the external particles. In the code for the 
permutation amplitude, List [21 a function, iPNext, gen- 
erates a set of integers for the next permutation in the 
sequence. For all computation of amplitudes, new data 
type "cmplx", defined in List 1161 for complex numbers on 
GPU is used. 



A unit program, called a thread, which is executed on 
a single processor, calculates the amplitudes for one event. 
In the CUDA programming model a set of threads forms 
a thread block. Threads within one thread block can share 
data through the shared memory and cooperate among 
themselves. In the current architecture of NVIDIA's CPUs 
a thread block can contain up to 512 threads. The size of 
a thread block can be changed within this limit when the 
program is executed on a GPU, and we optimize it to 
obtain the best performance of our program. 

With a single call of a kernel, multiple thread blocks 
are executed in parallel on a GPU. A set of thread blocks, 
which is executed with a single kernel call, is called a grid. 
Even within a grid, threads in different thread blocks can- 
not share data through the shared memory. 



Table 2. Configuration of the host PC 





Linux PC 


iMac 


CPU 


Core2Duo 3GHz 


Core2Duo 3.06GHz 


L2 Cache 


6MB 


6MB 


Memory 


4GB 


2GB 


Bus Speed 


1.333GHz 


1.07GHz 


OS 


Fedora 8 (64 bit) 


MacOS X 10.5.5 


GPU 


GTX280 & 
9800GTX 


8800M GTS 



3.2 Execution of a CUDA function on GPU 

Once a set of random numbers is generated on the CPU, 
computations from the generation of phase space points to 
the calculation of squared amplitudes are done by calling 
a CUDA function which is executed on the GPU. A func- 
tion, which is called from a CPU program and is executed 
on a GPU, is called a kernel. In our program, a single call 
of a kernel function computes the scattering amplitudes 
at multiple phase-space points in parallel on a GPU. 



3.3 Host PC environment 

We tested our program on three different CPUs by NVIDIA: 
the GeForce GTX280, 9800GTX and 8800M GTS. The pa- 
rameters for these three CPUs are given in Table [TJ The 
GeForce 9800GTX with 128 processors was introduced in 
April/2008 by NVIDIA as a high-end single GPU graphic 
card for the high performance output of complex images 
like 3D to the PC display. The GeForce GTX280 with a 
new processor architecture, introduced in June/2008, has 
240 processors, 1GByte global memory and 16k registers. 
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Table 3. HEGET functions for external lines 



External Line 


HEGET Function 


HELAS Subroutine 


Flowing-In Fermion 
Flowing-Out Fermion 


ixxxxO, ixxxxl, ixxxx2 
oxxxxO, oxxxxl, oxxxx2 


IXXXXX 
OXXXXX 


Vector Boson 


vxxxxO, vxxxxl, vxxxx2 


VXXXXX 



Table 4. List of the HEGET vertex functions 



Vertex 


Inputs 


Output 


HEGET Function 


HELAS Subroutine 


FFV 


FFV 
FF 
FV 


Amplitude 
V 
F 


iovxxx 
j ioxxO 
fvixxO, fvoxxO 


IOVXXX 
JIOXXX 
FVIXXX, FVQXXX 



which are about twice those of 9800GTX, with a 20% 
slower clock rate. A GeForce 8800M GTS is installed on 
an Apple iMac, and has 64 processors and the clock rate 
of 0.4GHz, but still has the same memory and register size 
as the 9800GTX. 

Parameters for the two host PCs are summarized in 
Table [2j They have comparable capability, although the 
Linux PC has twice larger memory size and 25% faster 
bus speed. 



4 HEGET functions 

In this section, we describe a set of CUDA functions which 
can be used on a GPU for performing helicity ampli- 
tude calculations. Based on the FORTRAN version of 
the HELAS library [3] we developed a set of functions, 
HEGET, written in CUDA. 

These functions are directly converted from the HELAS 
subroutines into C code. We kept the order of the ar- 
guments of the HELAS subroutines but parameters for 
masses and widths were removed from the argument list. 

The HEGET functions are organized to maximize the 
performance on the GPU. Since conditional branches with- 
in programs using "if" statements reduces the total effi- 
ciency of parallel computing, we eliminated them as much 
as possible in the HEGET functions. Accordingly, we pre- 
pared separate HEGET codes for the computation of mas- 
sive and massless wave functions. In this paper we only use 
the HEGET functions for massless particles. 

By the same token, we introduce three types of func- 
tions for external wave functions; "1" for particles moving 
in the +z direction, "2" for particles moving in the —z 
direction and "0" for particles with momentum, p, along 
a generic direction. These numbers are appended to the 
name of the corresponding functions. 

All functions relevant to this paper are listed in Ta- 
ble [3] and Tabled! which include all functions for massless 
fermions and massless vector bosons (photons), respec- 
tively. The naming scheme for HEGET functions follow 
that of HELAS subroutines: the HEGET (HELAS) func- 
tion names start with i(I) and o(0) for flow- in and flow- 
out fermion wave functions, respectively, v(V) for vector 
boson wave functions, f (F) for off-shell fermions and j ( J) 
for off-shell vector bosons. The correspondence between 



the HEGET functions and the HELAS codes are shown 
explicitly in Tables [3] and H) 

All of the HEGET functions that are used in this re- 
port are listed in [Appendix B| as List [3] to List [151 Below, 
we explain each function. 



4.1 Fermion wave functions 

We have two types of functions to compute external fermi- 
ons. One is for "flowing-In" fermions and the other is for 
"flowing-Out" fermions. 

For the flowing-In spinor wavefunction of a fermion, 
we prepare three functions, ixxxxl (List[3|), ixxxx2 (List 
|4]) and ixxxxO (List [5]), derived from the HELAS sub- 
routine, IXXXXX. These three functions compute massless 
fermion wavefunctions with the flowing-In fermion num- 
ber, ixxxxl is for a fermion with momentum along the 
+z direction, ixxxx2 is for a fermion with momentum 
along the —z direction, and ixxxxO is for a fermion with a 
generic three-momentum, p. All functions have the same 
argument list: 

ixxxxfc (float* p, int nHEL, int nSF, cmplx* fi) 

(11) 

for fc = 1, 2 and 0. The inputs and outputs are: 
Inputs: 

float p[4] 4-momentum 

int nHEL twice fermion helicity (-1 or 1) 

int nSF +1 for particle, -1 for anti-particle 



(12) 



Outputs: 

cmplx fi[6] fermion wavefunction |fi> 
'u(p,nHEL/2) for nSF = +1 
w(p,nHEL/2) for nSF = -1 



For the flowing-Out spinor wavefunction of a fermion, 
we also prepare three functions, oxxxxl (List [6]), oxxxx2 
(List [T]) and oxxxxO (List [S]), derived from the HELAS 
subroutine, QXXXXX. 

These three functions compute a massless fermion wave- 
function with a flowing-Out fermion number, oxxxxl is for 
a fermion with momentum along the +z direction, oxxxx2 
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is for a fermion with momentum along — z direction, and 
oxxxxO is for a final state fermion with a generic three- 
momentum, p. All functions have the same argument list: 



oxxxxfc (float* p, int nHEL, int nSF, cmplx* fo) 

(13) 

for fc = 1, 2 and 0. The inputs and outputs are: 
Inputs: 

float p[4] 4-momentum 
int nHEL twice fermion helicity (-1 or 1) 
int nSF +1 for particle, -1 for anti-particle 



(14) 



Outputs: 

cmplx fo[6] fermion wavefunction <fo I 
u(p,nHEL/2) for nSF = -fl 
U(p,nHEL/2) for nSF = -1 



4.2 Vector boson wave functions 

For the wavefunction of a massless vector boson, we pre- 
pare three functions, vxxxxl (List [9]), vxxxx2 (List [TOl) 
and vxxxxO (List [TT|) , derived from the HELAS subrou- 
tine, VXXXXX. 

These three functions compute a massless vector bo- 
son wavefunction from its four-momentum and helicity. 
vxxxxl is for a vector boson with momentum along the 
+z direction, vxxxx2 is for a vector boson with momentum 
along —z direction, and vxxxxO is for a vector boson with 
a generic three-momentum, p. Their argument list is: 

vxxxxfc (float* p, int nHEL, int nSV, cmplx* vc) 

(15) 

for /c = 1, 2 and 0. The inputs and outputs are: 
Inputs: 

float p [4] 4-momentum 

int nHEL helicity of massless vector (-1 or 1) 
int nSV +1 for final, -1 for initial vector 



Outputs: 

cmplx vc [6] vector boson wavefunction 
e''(p,nHel)* for nSV = +1 
e''(p,nHel) for nSV = -1. 



4.3 FFV: fermion-fermion-vector boson vertex 

For the fermion-fermion-vector boson vertex, 



(16) 



^FFV = ll^Fl^ 



gal[0]i-2^+gal[l]i±^ 



^fv; (17) 



there are four functions, iovxxx, fvixxO, fvoxxO and 
jioxxO, in HEGET. They correspond to HELAS subrou- 
tines, IDVXXX, FVIXXX, FVOXXX and JIDXXX, respectively, 
for massless particles. 



4.3.1 iovxxx 

The function iovxxx (List [12]) computes the amplitude of 
the FFV vertex from a flowing-In fermion, a flowing-Out 
fermion and a vector boson wave functions, whether they 
are on-shell or off-shell. It has the arguments: 



iovxxx(cmplx* fi, cmplx* fo, cmplx* vc , 
float* gal, cmplx vertex) 

where the inputs and the outputs are: 



(18) 



Inputs: 

cmplx f i [6] flowing-ln fermion wavefunction 
cmplx fo[6] flowing-Out fermion wavefunction 
cmplx vc [6] vector wavefunction 
float gal [2] coupling constants of FFV vertex (19) 

Outputs: 

cmplx vertex amplitude of the FFV vertex 

<fo|V|fi> 



The two chiral couplings of eq. p9|) are: 

gal [0] = gal [1] = ~eQq (20) 

for the qq^ vertex in QED, where Qq is the electric charge 
of the quark in units of the proton charge, e = y/Ana. 

4.3.2 f vixxO 

The function f vixxO fList. [T5|) computes the off-shell mass- 
less fermion wavefunction from a flowing-ln external fermi- 
on and a vector boson. It has the arguments: 



fvixxO (cmplx* fi, cmplx* vc, float* gal, 

cmplx* fvi) 

where the inputs and the outputs are: 



(21) 



Inputs: 

cmplx f i [6] flowing-ln fermion wavefunction 

cmplx vc [6] vector wavefunction 

float gal [2] coupling constants of the FFV vertex 

Outputs: 

cmplx fvi [6] off-shell fermion wavefunction 
If ,vc,fi> 



(22) 



4.3.3 fvoxxO 



The function fvoxxO (List. [H|) computes the off-shell mass- 
less fermion wavefunction from a flowing-Out external fermi- 
on and a vector boson. It has the arguments: 



fvoxxO (cmplx* fo, cmplx* vc, float* gal, 

cmplx* fvo) 



(23) 
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where the inputs and the outputs are: 
Inputs: 

cmplx fo[6] flowing- Out fermion wavefunction 

cmplx vc [6] vector wavefunction 

float gal [2] coupling constants of the FFV vertex 

Outputs: 

cmplx f vo [6] off-shell fermion wavefunction 
<fo,vc,f ' I 

(24) 



find the results obtained by the HEGET functions agree 
with those from the other programs within the statistics 
of generated number of events. 

Up to ri = 5, MadGraph/MadEvent gives the total 
cross sections and distributions, and we find agreements 
among all the programs. As for n = 6 and 7, the MadGraph 
generated HELAS FORTRAN codes can be integrated by 
using BASES [8, , and the results agree well with those 
of HEGET codes. For n = 8, the FORTRAN program 
that performs (8! — 1) permutations of eq. ([6|) was used to 
compute the matrix elements, which were integrated by 
BASES. 



4.3.4 jioxxO 

This function, jioxxO fList fTS]) computes the off-shell vec- 
tor wavefunction from an external fermion pair. The off- 
sheU vector boson wavefunction is given in the Feynman 
gauge for massless vectors (photons and gluons in the SM). 
It has the arguments: 



jioxxO (cmplx* fi, cmplx* fo, float* gal, 

cmplx* jio) 

where the inputs and the outputs are: 



(25) 



Inputs: 

cmplx f i [6] flowing-In fermion wavefunction 
cmplx fo[6] flowing-Out fermion wavefunction 
float gal [2] coupling constants of the FFV vertex 



Outputs: 

cmplx jio[6] vector current j'^ (<fo|V|fi>) 



(26) 



5.2 Kinematical distributions 

We also compare several kinematical distributions of the 
generated events. As an example, in Fig. [2] distributions 
of the maximum transverse momentum and AR between 
final state photons for the uu — > 5-photons process are 
shown for the three programs. We find excellent agree- 
ment. 




Fig. 2. Distributions of maximum transverse momentum and 
AR of final state photons in uu — > 5-photons 



5 Validation of the HEGET functions 



In order to validate the HEGET functions we compare 
the total and differential cross sections of n-photon pro- 
duction processes computed on the GPU with the HEGET 
functions with those calculated by other programs which 
are based on the FORTRAN version of the HELAS li- 
brary. We use MadGraph/MadEvent [4] and an indepen- 
dent FORTRAN program which calculates total cross sec- 
tion and kinematical distributions with the Monte Carlo 
integration program BASES [8] as references. 

For comparison, we use the same physics parameters 
as MadGraph/MadEvent for all programs. As the parton 
distribution function, we use CTEQ6L1 [9] and set the 
factorization scale to be the pt cut of values, Q = p™' = 
20GeV; see eq. ^. 

5.1 Total cross sections 

For the calculation of the n-photon production cross sec- 
tions, the same final state cuts for all the observed photons 
are apphed; see (eq. (O in Sec. l2.2p . Results for the compu- 
tation of the total cross sections are listed in Table [5l We 



Similar comparisons have been performed for all the cases 
listed in Table [3 and we find agreement within statistical 
errors. 



6 Performance comparison 

6.1 Register allocation and size of the thread block 

Event process time is measured by the standard C library 
function on the CPU. We measured the time between the 
start of the transfer of random numbers to the GPU and 
the end of the transfer of computed results back to the 
CPU. The process time for one event is obtained by di- 
viding the measured time by the total number of events. 
We also prepared a CPU program of the same structure 
and measured the event process time for the equivalent 
part of the program. The total number of events used for 
the measurement depends on the number of photons in 
the final state and the specifications of the GPU. For the 
GTX280, we generate IM events for a single kernel call 
and repeated 100 calls of the kernel for processes with 
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Table 5. Total cross sections [fb] for uu wy at the LHC. 



Number of photons 


HEGET 


Bases 


MadGraph /MadEvent 






2 


1.0822 ± 0.0056 


1.08265 ± 0.00031 


1.0811 ±0.0019 


xlO-* 




3 


6.7776 ± 0.0082 


6.7849 ± 0.0051 


6.775 ±0.031 


xlO" 




4 


1.2400 ± 0.0030 


1.2280 ±0.0029 


1.2372 ±0.0041 


xlO^ 


2 


5 


2.598 ±0.011 


2.596 ±0.010 


2.572 ± 0.053 


xlO" 


5 


6 


5.799 ± 0.017 


5.792 ±0.014 




xlO" 


8 


7 


1.264 ± 0.008 


1.261 ±0.004 




xlO" 


10 


8 


2.77 ±0.05 


2.4 ±0.3 




xlO" 


13 



n-y < 5. In short, we generated lOOM events for a single 
execution of the program. 

When we compile CUD A programs, the maximum num- 
ber of registers allocated to a thread can be specified. How- 
ever, the total number of registers available per thread 
block is limited. If we allocate more registers to a thread, 
then the maximum number of threads in a thread block, 
the block size, must become smaller. We find the perfor- 
mance of the program depends on these two parameters, 
the number of registers in a thread and the number of 
threads in a block. Hence, we made a systematic study of 
the performance of these parameters. 

The basic unit of the GPU processor is a multiproces- 
sor called the Streaming Multiprocessor (SM). Threads 
which belong to one thread block are concurrently exe- 
cuted on one SM. One SM consists of eight Scalar Proces- 
sors (SP) and four threads can be executed on one SP at 
the same time. One multiprocessor can therefore process 
a group of 32 threads, called a warp, in parallel. 

The number of threads in a thread block of a kernel 
execution can be set arbitrarily up to a maximum number 
limited by the number of registers per thread, but the 
performance is found to be better when it is a multiple of 
the warp size, i.e. 32 threads. This is because otherwise 
there are idle SPs during the kernel execution. 

Fig. [3] shows the dependence of the event process time 
for the 5-photon production process, uu 67, on the size 
of thread blocks and the number of allocated registers per 
thread on GTX280. GTX280 has 16k registers, which are 
split into threads in a thread block. Three cases of num- 
bers of registers per thread, 42, 64 and 124, are plotted in 
Fig. [31 which correspond to the maximum available num- 
bers of threads in a multiple of 32 threads, 384, 256 and 
128, respectively. 

The dependence on the blocks size shows a clear peri- 
odicity of a multiple of 32 threads. The best performance 
for the 57 production process was observed for a combina- 
tion of 64 registers per thread and 256 threads in a thread 
block. 



6.2 Comparison of the event process time 

In Fig. m the measured process time for one event of 
n-photons production processes is shown for the GPU 
(GTX280) and the CPU (PC Linux with Fedora 8). They 
are plotted versus the number of photons in the final state. 
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Fig. 3. Dependence of event process time on block size (the 
number of threads per thread block) and the number of regis- 
ters per thread on a GTX280. 



The upper two lines show the event process time for 
the computation on the CPU. The programs with the 
MadGraph generated HELAS codes run faster than those 
based on the permutation amplitude for larger number of 
photons in the final state (n.^ >5). This is essentially be- 
cause the MadGraph-gencrated codes avoid repetition of 
computing the same off-shell amplitudes. A larger frac- 
tion of off-shell amplitudes is repeated for larger n^, and 
the ratio of the process time for the MadGraph gener- 
ated amplitudes and the permutation amplitudes grows 
from 1.7 for = 5 to 3.8 for = 7. For = 8, only the 
permutation-based program can be compiled on the CPU. 

The lower three lines show the event process time on 
the GPU (GTX280). In all cases the program on the GPU 
computes one event much faster than the CPU. For the 
computation on the GPU the HEGET programs obtained 
by converting the HELAS codes generated by the Mad- 
Graph runs faster than those based on the permutations 
of a single amplitude, when — 3 or larger. The ratio 
grows to a factor of 2.3 for n-y = 5. 

When the number of photons becomes larger than five, 
the MadGraph amplitude (the HEGET code which is con- 
verted from the HELAS code generated by MadGraph) 
becomes very long, and the present CUD A compiler can- 
not process the converted program. In order to compile the 
long HEGET program for large n-photon processes we di- 
vide the program into smaller pieces. Each piece computes 
a subset of the diagrams. The CPU program calls separate 
kernels for each amplitude sequentially and sums up their 
computed amplitudes. This method works for n — 7, or 
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7! = 5040 diagrams, whose HEGET code can be compiled 
by dividing it into about 500 kernel calls. The 8-photon 
production process has so many diagrams (8! — 40320), 
that we have not been able to compile the HEGET pro- 
gram even after dividing it into small pieces. 

For the case of nj=5 we examine both programs, one 
with a single amplitude and the other with divided am- 
plitudes. We find that the multiple kernel program with 
divided amplitudes runs faster than the single kernel pro- 
gram for the whole amplitude by about 30%. This is prob- 
ably because the size of the 67 amplitude program is near 
the maximum capability of the present CUDA compiler. 
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Fig. 5. Ratio of event process times (CPU/GPU). 
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Fig. 4. Event process times for the GPU and CPU. 



6.3 Comparison of performance of GPU and CPU 

The ratios of event process times between the CPU and 
GPU are shown in Fig. \5\ The solid curves give the ra- 
tios between the permutation amplitudes on the CPU (the 
upper CPU points in Fig. 2]) and those on the GPU (the 
upper GPU points in Fig. [4]). The performance ratio is 
about 120 for — 2 and 3, and it goes down to about 
45 for n-y = 6, 7 and 8. A higher-performance ratio is 
found when we compare the event process time of the 
MadGraph-gencratcd amplitude on the CPU (the lower 
CPU points in Fig. g]) and that on the GPU (the lower 
GPU points in Fig. |4]). Ratios above 150 are found for 

— 3 and 4, going down to about 45 for = 7. For 
n-y = 6 and 7, the HEGET programs from the MadGraph- 
generated HELAS codes are too big for the present CUDA 
compiler, and the results are shown for the GPU program 
with multiple kernel calls for one event. Nevertheless, the 
performance ratio is about 60 for n-y = 6 and about 45 for 
n-y = 7. Even more surprisingly, the performance is bet- 
ter for the divided amplitude with multiple kernel calls 
than the single kernel computation for = 5, with the 
performance ratio exceeding 100. 

To summarize, we find that GPU programs run faster 
than CPU programs by a factor exceeding 100 for <5, 
while a performance gain larger than 40 is achieved for 
other < 8. 



6.4 Difference among CPUs 

Wc also examine the performance of two other CPUs, 
the 9800GTX and the 8800M GTS. The GPU board with 
9800GTX is connected to the same host PC as GTX280. 
On the other hand, the 8800M GTS is a built-in graphic 
card on the iMac. We ran the same programs for all CPUs 
and measures their event process times. Fig. [6] shows the 
process time per event for the permutation amplitudes 
on the 9800GTX and 8800M GTS relative to a GTX280, 
which was shown in Fig. 2] as "GPU permutation" . The 
ratio of process times among various CPU's is roughly 
proportional to the inverse of the number of multiproces- 
sors: 240 on the GTX280, 128 on the 9800GTX, 64 on 
the 8800M GTS; see Table [H It is worth noting that even 
the small built-in graphic card (8800M GTS) on an iMac 
can compute n-y production processes faster than CPU by 
about factors of 25, 15 and 10, respectively, for rij — S, 4 
and 5. 
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Fig. 6. Event process times for GPUs. 



6.5 Double precision calculation 

The CTX280 has one double precision processing unit for 
each streaming multiprocessor (SM). We prepared a dou- 
ble precision version of the HEGET library and the CUDA 
program which computes the cross sections with dou- 
ble precision. The structure of these programs is the same 
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as the single precision version described above. Results of 
the single and double precision computations agree. Fig. [7] 
shows the ratio of event process times between double and 
single precision computations. Because the number of dou- 
ble precision processing units is small compared to the 
number of single precision processing units, the program 
with the double precision computations runs 3-4 times 
slower than that with single precision. 



amplitudes. It is clearly seen from Fig. [8] that by reduc- 
ing the while loop the performance of the program with 
permutation improves significantly for > 5, approach- 
ing that of the MadGraph based programs. The presence 
of the while loop is probably the main cause of the poor 
performance of the permutation amplitudes as compared 
to the MadGraph generated ones, as shown in Fig. HI 
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Fig. 7. Ratio of event process times (double/single precision) 
on a GTX280. 
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Fig. 8. Effect of unrolling while loops in "permutation" am- 
plitudes on a GTX280. 



6.6 Effect of unrolling loops 

In the permutation amplitude program, a combination of 
integer numbers for successive permutations is generated 
sequentially within a while loop of the C-language: see 
[Appendix A.2| for n-y = 3 as an example. Since programs 
with loops or branches lower the efficiency of parallel com- 
puting on the GPU, in general, there is a possibility of 
improving the performance of the permutation program 
by spreading out, or unrolling, a part of the while loop in 
the permutation. 

We therefore tested the effect of unrolling the while 
loop for the permutation generation. Sample code for the 
case of the 3-photon production process is shown in 
[Appendix C| as List [iTl for the amplitude program with 
one permutation unrolled. It should be compared with the 
program List [2] in [Appendix A.2[ which performs 3! = 6 
permutation within one while loop. Fig. [8| shows the rela- 
tive performance of the programs after unrolling a part of 
the while loop as compared to the process time presented 
in Fig. m as "GPU permutation" . 

We find that the effect of unrolling is significant, es- 
pecially for large n.y processes. The improvement of the 
execution speed becomes a factor of 3 for n~f — l and 8. Be- 
cause the event process time does not change significantly 
by unrolling the while loop on the CPU, the CPU/GPU 
ratio plotted in Fig. [5] for 'permutation' amplitudes grows 
from ~ 45 to ^ 150 for ^7 and 8 after unrolling two 
permutations. 

Also plotted as a reference in Fig. [8| are the relative 
performances of the computation with the converted Mad- 
Graph amplitudes as compared to those of permutation 



7 Summary 

We have presented the results of our attempt to compute 
multi-particle production events at hadron colliders on 
GPUs Ij, Graphic Processing Units. Our achievements 
and findings may be summarized as follows. 

— HELAS subroutines [3] written in FORTRAN were 
converted to HEGET functions written in CUDA [2], 
a C-language developed for GPU computing. 

— The HELAS amplitude code for uu n^'s {uj < 7) 
generated by MadGraph [4] was converted to a CUDA 
program which calls HEGET functions. 

— CUDA programs that compute (n !— 1) permutations of 
a single Feynman amplitude were also made for nj<8. 

— Event process times of the GPU program on a GTX280 
are more than 100 times faster than the CPU program 
for n-y <5, while the gain is reduced to about 45 for 
nj = 7 and 8. 

— The present CUDA compiler cannot process the 
HEGET programs converted from the MadGraph gen- 
erated HELAS codes for > 6. 

— For nj = 5, 6 and 7, we were successful in running the 
converted MadGraph programs by dividing the whole 
amplitude into smaller pieces and calling them sequen- 
tially on the CPU. Performance improves by this divid- 
ing procedure for n.y = 5, where the whole amplitude 
can be computed by one kernel call. 

— CUDA programs based on permutations of one HEGET 
amplitude are about a factor of 3 to 4 slower than 
those based on divided MadGraph codes for n-^ > 5. 
The main cause of this slowdown was identified as the 
while loop in the program that generates permutation 
of n-y-integers. 
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The same programs run on the GTX280, 9800GTX 
and 8800M GTS, and the performance is roughly pro- 
portional to the number of processors: 240, 128 and 

64, respectively. 

Double precision amplitudes on a GTX280 were ex- 
amined, finding a factor of 2.5 to 4 slower performance 
compared to the single precision computation. 
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vxxxx0(p5, nli5, +i, w[4]); 

cmplx wOl [6] ,w02 [6] ; 

cmplx ampsum = mkcmplxCO.Of , O.Of); 

int Flag_Perm=l; 
int a [3] ; 
a[0] =2; 



a[l] 
a [2] 



do { 

cmplx amp; 

fvoxxO(w[i], w[a[o]], gau, wOl); 
fvoxxO(w01 , w[a[i]]. gau, w02); 
iovxxO(w[o] , w02, w[a[2]], gau, amp); 

ampsum += amp; 
iPNextCa, 3, Flag_Perm) ; 
} while (Flag_Perm>o) ; 



Appendix A Sample codes for amplitude 
calculations 

Appendix A.l Three photon production amplitude 
with the converted MadGraph code 

List 1. uux3a(madgraph).cu 
cmplx w01[6], w02[6], w03[6], w04[6], w05[6]; 

ixxxxKpl, nhl, +1, wOl) 

oxxxx2(p2, nh2, -l, w02) 

vxxxx0(p3, nh3, +1, w03) 

vxxxxO(p4, nh4, +i, w04) 

vxxxx0(p5, nli5, +i, w05) 



cmplx w06 [6] , w07 [6] , w08 [6] ; 

cmplx ampsum = mkcmplx(o.of , 
cmplx amp; 



gau,w06) ; 
gau,w07) ; 
w05,gau,amp) ; 

gau,w07) ; 
gau,w08) ; 
w03 , gau , amp) ; 

gau,w06) ; 
gau,w07) ; 
w05,gau,amp) ; 

gau,w06) ; 
gau,w07) ; 
w03,gau,amp) ; 

gau,w07) ; 
,gau,w08) ; 
,w05,gau,amp) ; 

gau,w07) ; 
gau,w06) ; 
,w05,gau,amp) ; 



O.Of); 



f voxxO 
f voxxO 
iovxxx 
ampsum 
f vixxO 
f voxxO 
iovxxx 
ampsum 
f voxxO 
f vixxO 
iovxxx 
ampsum 
f voxxO 
f vixxO 
iovxxx 
ampsum 
f vixxO 
f vixxO 
iovxxx 
ampsum 
f vixxO 
f voxxO 
iovxxx 
ampsiuu 



(w02,w03 
(w06,w04 
(w01,w07 
amp; 
(w01,w04 
(w02,w05 
(w07,w08 
+= amp; 
(w02,w03 
(w01,w04 
(w07,w06 
amp; 
(w02,w04 
(w01,w05 
(w07,w06 
amp; 
(w01,w03 
(w07,w04 
(w08,w02 
+= amp; 
(w01,w03 
(w02,w04 
(w07,w06 
+= amp; 



Appendix A. 2 Three photon production amplitude 
with permutations 



List 2. uux3a(permutation).cu 



cmplx W [5] [6] ; 

ixxxxKpl, nhl, +1, w[o]) 

oxxxx2(p2, nh2, -l, w[l]) 

vxxxx0(p3, nh3, +1, w[2]) 

vxxxx0(p4, nh4, +1, w[3]) 



Appendix B HEGET functions 
Appendix B.l Fermion wave functions 



List 3. ixxxxl.cu 



#include " cmplx. h" 
device 

void ixxxxKfloat* p, int nHEL, int nSF, cmplx* fi) 

float SQP0P3 = sqrtf (p[0]+p[3])*(float)(nSF); 
int NH = rLHEL*nSF; 

fi[4] = mkcmplx(p[0]* (float )(nSF), 

p[3]»(float)(nSF)); 
fi[5] = mkcmplx(p[l]*(float)(nSF), 

p[2]*(float)(nSF)); 

cmplx CHI = mkcmplx(NH*p[i]*(i.of/SQPOP3), 

p[2]*(i.of/SQPOP3)); 
cmplx CZERO = mkcmplxco.of , o.of); 
cmplx CSQP0P3 = mkcmplx(SQP0P3, o.of); 

fi[0] = (NH== l)*CZERO + (NH==-l)*CHI; 
fi[l] = (NH== l)*CZERO + (NH==-l)*CSQP0P3; 
fi[2] = (NH== l)*CSQP0P3 + (NH==-l)*CZERO; 
fi[3] = (NH== l)*CHI + (NH==-l)*CZERO; 

return; 



List 4. ixxxx2.cu 



#include " cmplx. h" 
device 

void ixxxx2(f loat* p, int nHEL, int nSF, cmplx* fi) 
int NH=nHEL*nSF; 

fi[4] = mkcmplx(p[o]*(float)(nSF), 

p[3]*(float)(nSF)); 
fi[5] = mkcmplx(p[l]*(float)(nSF), 

p[2]*(float)(nSF)); 

cmplx CHI = mkcmplxc nHEL*sqrtf (2.of *p[o]) , 
o.of); 

cmplx CZERO = mkcmplxco.of , o.of ); 

fi[0] = (NH== l)*CZERO + (NH — l)*CHI; 
f i[i]=CZERO; 
f i[2]=CZER0; 

fi[3] = (NH== l)*CHI + (NH — l)*CZERO; 
return; 



List 5. ixxxxO.cu 

#include " cmplx. h" 
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device 

void ixxxxOCf loat* p, int nHEL, int nSF, cmplx* fi) 

float SQPOPS = sqrtf (p[0]+p[3])*(float)(nSF); 
int NH = nHEL*nSF; 

fi[4] = mkcmplx(p[0]*(float)(nSF), 

p[3]*(float)(nSF)); 
fi[5] = mkcmplx(p[i]*(float)(nSF), 

p[2]*(float)(nSF)); 

cmplx CHI = mkcmplxc NH*p[i]*(i.of /SQP0P3) , 

(1.0f/SQP0P3)*p[2]); 

cmplx CZERO = mkcmplx(o.of ,o.of ); 
cmplx CSQP0P3 = mkcmplx(SQP0P3, o.of); 

fi[0] = (NH== l)*CZERO + (NH — l)*CHI; 
fi[l] = (NH== l)*CZERO + (NH — l)*CSQP0P3; 
fi[2] = (NH== l)*CSQP0P3 + (NH — l)*CZERO; 
fi[3] = (NH== l)*CHI + (NH — l)*CZERO; 

return; 



int NH = nHEL*nSF; 



fo[4] 



mkcmplx(p[o] « (float) (nSF) , 
p[3]»(float)(nSF)); 
fo[5] = mkcmplx(p[l]*(float)(nSF), 
p[2]*(float)(nSF)); 

cmplx CHI = mkcmplx( NH*p[i] t(i.of /SQP0P3) , 

-p[2]*(i.of/SQPOP3)); 
cmplx CZERO = mkcmplx(o.of , o.of); 
cmplx CSQP0P3 = mkcmplx(SQP0P3, o.of); 

fo[0] = (NH== l)*CSQP0P3 + (NH==-l)*CZERO; 
fo[l] = (NH== l)»CHI + (NH==-l)*CZERO; 
fo[2] = (NH== l)*CZERO + (NH==-l)*CHI; 
fo[3] = (NH== l)*CZERO + (NH — l)*CSQP0P3; 

return; 



Appendix B.2 Vector boson wave functions 



List 6. oxxxxl.cu 



#include " cmplx. h" 
device 

void oxxxxKfloat* p, int nHEL, int nSF, cmplx* fo) 

float SqP0P3=sqrtf (p [o] +p [3] ) * (float) (nSF) ; 
int NH=nHEL*nSF; 

fo[4] = mkcmplx(p[0]*(float)(nSF), 

p[3]*(float)(nSF)); 
fo[5] = mkcmplx(p[i]*(float)(nSF), 

p[2]*(float)(nSF)); 

cmplx CHI = mkcmplx( NH*p[l]*(l.of /SQP0P3) , 

-p[2]*(i.of/SQPOP3)); 
cmplx CZERO = mkcmplx(o.of , o.of); 
cmplx CSQP0P3 = mkcmplx(SQP0P3, o.of); 

fo[0] = (NH== l)»CSQP0P3 + (NH==-l)*CZERO; 
fo[l] = (NH== l)*CHI + (NH==-l)*CZERO; 
fo[2] = (NH== l)*CZERO + (NH==-l)*CHI; 
fo[3] = (NH== l)*CZERO + (NH — l)*CSQP0P3; 

return; 



List 7. oxxxx2.cu 

#include " cmplx. h" 
device 

void oxxxx2(f loat* p, int nHEL, int nSF, cmplx* fo) 
int NH=nHEL*nSF; 

fo[4] = mkcmplx(p[o]*(float)(nSF), 

p[3]*(float)(nSF)); 
fo[5] = mkcmplx(p[l]*(float)(nSF), 

p[2]*(float)(nSF)); 

cmplx CHI = mkcmplx(-nHEL*sqrtf (2.of *p[o] ) , 
o.of); 

cmplx CZERO = mkcmplx (o.of, o.of ); 
fo[o] =CZERO; 

fo[l] = (NH== l)*CHI + (NH==-l)*CZERO; 
fo[2] = (NH== l)*CZERO + (NH==-l)*CHI; 
fo[3]=CZER0; 

return; 



List 8. oxxxxO.cu 

#include " cmplx. h" 
device 

void oxxxxO(f loat* p, int nHEL, int nSF, cmplx* fo) 
float SQP0P3 = sqrtf (p[o]+p[3])*(float)(nSF); 



List 9. vxxxxl.cu 



#include " cmplx. h" 
device 

void vxxxxl (float* p, int nHEL, int nSV, cmplx* vc) 

vc[4] = mkcmplx (-p[o] , -p[3]); 
vc[5] = mkcmplx (-p[l] , -p[2]); 



vc [0] 
vc[3] 



mk cmplx (0. of , o.of); 
mkcmplx (o.of, o.of); 



vc[l] = mkcmplx (-(float) (nHEL) *rSQRT2, o.of); 
vc[2] = mkcmplx ( o.of, -rSQRT2); 

return; 



List 10. vxxxx2.cu 

#include " cmplx. h" 
device 

void vxxxx2(f loat* p, int nHEL, int nSV, cmplx* vc) 

vc[4] = mkcmplx (-p[o] , -p[3]); 
vc[5] = mkcmplx (-p[l] , -p[2]); 

vc[0] = mkcmplx (0. of , O.of); 
vc[3] = mkcmplx (o.of , o.of); 



vc[i] 

vc [2] 

return; 



mkcmplx(-(float)(nHEL)*rSQRT2, o.of); 
mkcmplx ( o.of, rSQRT2); 



List 11. vxxxxO.cu 



#include "cmplx. h" 
device 

void vxxxxO(f loat* p, int nHEL, int nSV, cmplx* vc) 

vc[4] = mkcmplx (p[o] , p[3])*nSV; 
vc[5] = mkcmplx (p[i] , p[2])*nSV; 

float rpt = rsqrtf (p[l]*p[l] + p[2]*p[2]); 

vc[o] = mkcmplx (o.of , o.of); 
vc [3] = mkcmplx ( 

(float) (nHEL) *(l. of /(rpt*p[0]))*rSQRT2, 

O.of); 

float pzpt = (p[3]*(i.of/p[o])*rpt)*rSQRT2 
♦ (float) (nHEL); 

vc[l] = mkcmplx (-p[l]*pzpt, 

-nSV*p[2] * rpt * rSQRT2); 
vc[2] = mkcmplx (-p [2] *pzpt. 
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+nSV*p[i] * rpt * rSQRT2); 

return; 

> 



Appendix B.3 Fermion-fermion-vector boson vertex 



List 12. iovxxx.cu 



#include "cmplx.h" 
device 

void iovxxOCcmplx* fi, cmplx* fo, cmplx* vc, 
float* gal, cmplxst vertex) 

vertex = 

gal [0] * ( (f o [2] »f i [0] +f o [3] *f i [1] ) *vc [o] 
+ (f o [2] *f i [i]+f o [3] *f i [0] )*vc [1] 

- ( (f O [2] *f i [1] -f O [3] *f i [0] ) *VC [2] ) 
*mk cmplx (0. of , l.of) 
+ (f o [2] *f i [0] -f o [3] *f i [1] ) *vc [3] ) 
+gal [1] * ( (f o [0] *f i [2] +f o [1] *f i [3] ) * vc [0] 
-(f o [0] *f i [3] +f o [1] *f i [2] ) *vc [1] 

+ ((f O [0] *f i [3] -f O [1] *f i [2] )*VC [2] ) 

*mk cmplx (0. of , l.of) 

- (f o [0] *f i [2] -f o [1] *f i [3] ) *vc [3] ) ; 

return; 

} 



List 13. fvixxO.cu 



#include "cmplx.h" 
device 

void fvixxO (cmplx* fi, cmplx* vc, float* gal, 
cmplx* fvi) 

fvi[4] = f i[4]-VC[4] ; 
fvi [5] = f i[5]-VC[5] ; 

float pf [4] ; 

pf [0] = fvi [4]. re; 
pf [1] = fvi [5] .re; 
pf [2] = fvi[5].im; 
pf [3] = fvi[4] .im; 

float pf2 = pf [o]*pf [0] - 

(pf [1] *pf [1] + pf [2] *pf [2] + pf [3] *pf [3] ) ; 
cmplx cl = mkcmplx( o.of, l.of); 
float d = -i.of/pf2; 

cmplx sll = (vc[o] + vc[3])*fi[o] 

+ (VC[1] - cI*VC[2])*f i[l] ; 

cmplx sl2 = (vc[o] - vc[3])*fi[l] 

+ (VC[l] + cI*VC[2])*fi[0] ; 

cmplx srl = (VC [0] - vc[3])*fi[2] 

- (VC[1] - Cl*VC[2])*f i[3] ; 

cmplx sr2 = (vc[o] + vc[3])*fi[3] 

- (vc[l] + cl*vc[2])*f i[2] ; 

fvi[o] = ( gal[o]*((pf [o]-pf [3])*sll 

- conj (fvi[5])*sl2))*d; 
fvi[i] = ( gal[o]*((pf [o]+pf [3])*sl2 

- fvi[5] *sll))*d; 

fvi [2] = ( gal[i]*((pf [o]+pf [3])*srl 

+ coni (fvi[5])*sr2))*d; 

fvi [3] = ( gal[i]*((pf [o]-pf [3])*sr2 
+ fvi[5] *srl))*d; 

return; 

} 



List 14. fvoxxO.cu 



#include "cmplx.h" 
device 

void f voxxO(cmplx* fo, cmplx* vc, float* gal, 
cmplx* fvo) 

fV0[4] = fo[4]+VC[4]; 
fvo [5] = f 0[5]+VC[5] ; 



float pf [4] ; 

pf [0] = fvo [4] .re; 
pf [1] = fvo [5] .re; 
pf [2] = fvo [5] .im; 

pf [3] = fvo [4] .im; 

float pf2 = pf [0]*pf [0] 

- (pf [l]*pf [1] + pf [2]*pf [2] + pf [3]*pf [3]); 
cmplx cl = mkcmplx( o.of, l.of); 

float d = -i.of/pf2; 

cmplx sll = (vc[o] + vc[3])*fo[2] 

+ (VC[1] + Cl*VC[2])*f 0[3] ; 

cmplx sl2 = (vc[o] - vc[3])*fo[3] 

+ (VC[1] - Cl*VC[2])*fO[2] ; 

cmplx srl = (vc[o] - vc[3])*fo[o] 

- (VC[1] + cI*VC[2])*f 0[1] ; 

cmplx sr2 = (vc[o] + vc[3])*fo[l] 

- (VC[1] - cI*VC[2])*f O[0] ; 

fvo[0] = ( gal[l]*((pf [0]+pf [3])*srl 
+ fvo[s] *sr2 ))*d; 

fvo[i] = ( gal[i]*((pf [o]-pf [3])*sr2 

+ conj (fvo[5])*srl ))*d; 

fvo[2] = ( gal[0]*((pf [0]-pf [3])*sll 

- fvo[s] *sl2 ))*d; 
fvo [3] = ( gal[0]*((pf [0]+pf [3])*sl2 

- conj (fvo[5])*sll ))*d; 

return; 

} 



List 15. jioxxO.cu 



#include "cmplx.h" 
device 

void jioxxO(cmplx* fi, cmplx* fo, float* gal, 
cmplx* jio) 

ji0[4]=f 0[4]-f i[4] ; 
ji0[B]=f 0[B]-f i[B] ; 

float DD=i.of/( 

(jio[4] .re)*(jio[4] .re) - (jio[5] .re)*(jio[5] .re) 
- (jio[5] .im)*(jio[5] .im) - (jio[4] .im)*(jio[4] .im)) ; 

j io [0] = (gal [0] * (f o [2] *f i [0] +f o [3] *f i [1] ) 

+gal [1] * (f o [0] *f i [2] +f o [1] *f i [3] ) ) *DD ; 

j io [1] =(gal [1] *(f o [0] *f i [3]+f o [1] *f i [2] ) 

-gal [0] * (f o [2] *f i [1] +f o [3] *f i [0] ) ) *DD ; 

j io [2] = (gal [0] * (f o [2] *f i [1] -f o [3] *f i [0] ) 
-gal [1] * (f o [0] *f i [3] -f o [1] *f i [2] ) ) 
*mkcmplx(o.of ,DD) ; 

j io [3] =(gal [1] *(f o [0] *f i [2] -f o [i] *f i [3] ) 

-gal [0] * (f o [2] *f i [0] -f o [3] *f i [1] ) ) *DD ; 

return; 

} 



Appendix B.4 Complex numbers 



List 16. cmplx.h 



#ifndef __ciiipix_h__ 
#def ine cmplx.h 

typedef struct align (8){ 

float re; 

float im; 
} cmplx; 

inline host device 

cmplx mkcmplx (float re, float im){ 

cmplx Z; 

z.re=re; 

z.im=im; 

return z; 

} 

inline host device 

cmplx mkcmplx(cmplx z){ 
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return mkcmplx(z.re,z.im); 

inline host device 

cmplx mkcmplx (float a){ 
return mkcmplx (a, o. o) ; 

inline host device 

float real (cmplx a){ 
return a.re; 

} 

inline host device 

float imag(cmplx a){ 
return a.im; 

} 

inline host device 

cmplx conj (cmplx a){ 

return mkcmplx(a.re,-a.im) ; 

inline host device 

cmplx operator* (cmplx a, cmplx b){ 

return mkcmplx(a.re + b.re, a.im + b.im); 

inline host device 

void operator+=(cmplx &a, cmplx b){ 



} 



a.re += b.re; a.im += b.im; 



b . im) ; 



inline host device 

cmplx operator+(cmplx a){ 

return mkcmplx(+a.re, +a.im); 

inline host device 

cmplx operator-(cmplx a, cmplx b){ 
return mkcmplx(a.re - b.re, a.im 

inline host device 

void operator-=(cmplx &a, cmplx b){ 
a.re -= b.re; a.im -= b.im; 

> 

inline host device 

cmplx operator-(cmplx a){ 

return mkcmplx(-a.re, -a.im); 

inline host device 

cmplx operator*(cmplx a, cmplx b)< 

return mkcmplx((a.re * b.re) - (a.im * b.im), 
(a.re * b.im) + (a.im * b.re)); 

} 

inline host device 

cmplx operator*(cmplx a, float s){ 
return mkcmplx(a.re * s, a.im * s); 

inline host device 

cmplx operator*(f loat s, cmplx a){ 
return mkcmplx(a.re * s, a.im * s); 

inline host device 

void operator*=(cmplx &a, float s){ 
a.re *= s; a.im *= s; 

> 

inline host device 

cmplx operator/(cmplx a, cmplx b){ 

float tmpD=(i.of /(b.re*b.re+b.im*b.im)) ; 
return mkcmplx ( 

( (a.re * b.re) + (a.im * b.im))*tmpD, 
(-(a.re * b.im) + (a.im * b.re))*tmpD 



inline host device 

cmplx operator/ (cmplx a, float s){ 

float inv = i.of / s; 

return a * inv; 

> 

inline host device 

cmplx operator/ (float s, cmplx a){ 

float inv = s*(l.of/(a.re*a.re+a.im*a.im)); 

return mkcmplx(inv*a.re,-inv*a.im) ; 



inline host device 

void operator/=(cmplx &a, float s){ 

float inv = i.of / s; 

a *= inv; 

> 

inline host device 

float fabsc (cmplx a){ 

return sqrtf ((a.re*a.re)+(a.im*a.im)); 

inline host device 

float fabs2c (cmplx a){ 

return (a.re*a.re)+(a.im*a.im); 

} 

#endif 



Appendix C A sample code for unrolling the 
permutation loop for three photon 
production process 

List 17. uux3a(unrolled).cu 
cmplx w [5] [6] ; 

ixxxxKpl, nhl, +1, w[o]) 

oxxxx2(p2, nh2, -l, w[l]) 

vxxxx0(p3, nh3, +1, w[2]) 

vxxxxO(p4, nh4, +1, w[3]) 

vxxxx0(p5, nh5, +1, w[4]) 

cmplx w01[6],w02[6]; 

cmplx ampsum = mkcmplx(o.of , o.of); 

int Flag_Perm=i; 
int a [3] ; 
a[o] = 2; 



a[i] 
a [2] 



do { 

cmplx amp; 

fvoxxO(w[l], w[a[o]], gau, 
fvoxx0(w09 , w[a[i]], gau, 
iovxxO(w[o], wlO, w[a[2]], 
ampsum += amp; 
fvoxxO(w[i], w[a[o]], gau, 
fvoxx0(w09 , w[a[2]], gau, 
iovxxO(w[o], wlO, w[a[l]], 
ampsum += amp; 
fvoxxO(w[l], w[a[2]], gau, 
fvoxx0(w09 , w[a[o]], gau, 
iovxxO(w[o], wlO, w[a[i]], 
ampsum += amp; 
iPNext(a, 2, Flag_Perm); 
} while (Flag_Perm>o) ; 



w09); 
wlO); 

gau, amp); 

w09); 
wlO); 

gau, amp); 

w09); 
wlO); 

gau, amp); 
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